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Abstract. In the model-based clustering of networks, blockmodelling may be used to identify 
roles in the network. We identify a special case of the Stochastic Block Model (SBM) where 
we constrain the cluster-cluster interactions such that the density inside the clusters of nodes 
is expected to be greater than the density between clusters. This corresponds to the intuition 
behind community-finding methods, where nodes tend to clustered together if they link to 
each other. We call this model Stochastic Community Finding (SCF) and present an efficient 
MCMC algorithm which can cluster the nodes, given the network. The algorithm is evaluated 
on synthetic data and is applied to a social network of interactions at a karate club and at a 
monastery, demonstrating how the SCF finds the 'ground truth' clustering where sometimes 
the SBM does not. The SCF is only one possible form of constraint or specialization that 
may be applied to the SBM. In a more supervised context, it may be appropriate to use other 
specializations to guide the SBM. 

Keywords. Model-based clustering, MCMC, Social networks, Community finding, Blockmod- 
elling 

1 Introduction 

Clustering typically involves dividing objects into clusters where objects are in some sense 
'close to' the other objects in the same cluster. Much research has been done into cluster- 
ing points in Euclidean space where points are put into the same cluster based on a distance 
metric between pairs points. But the data we have is of a different form, we have a network as 
input data. 

In network analysis, clustering is usually based on the idea that two nodes in the network are 
'close to' each other if they are linked to each other. This is called community-finding and is the 
main topic of this paper. There are a large number of methods using heuristic algorithms and 
non-statistical objective functions (il. E3. [if. The complexity issues around some such algorithms 
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are also discussed in the literature [2j, y] ■ For a thorough review of the broad area of research 
into clustering the nodes of a network, see [||. 

In the rest of this paper we focus on statistical models and algorithms, as they are relevant 
for our approach. We base our model in the Stochastic Block Model (SBM) of [111 ]. That model 
is not, by default, a community-finding model. For example, with the famous social network 
known as Zachary's Karate Club the SBM will, if asked for two clusters, divide the nodes into 
one small cluster of high-degree nodes and another cluster containing a large number of smaller- 
degree nodes. In community-finding, this would be seen as an 'incorrect' result; the members 
of the karate club went on to divide themselves into two factions, where most of the friendship 
edges are, unsurprisingly, inside the factions. Community-finding methods are expected to find 
this type of clustering, where the edges tend to be inside clusters. 

Many of the probabilistic models of networks are based on the SBM 0, Gi| and therefore they 
do not explicitly tackle community-finding. In this paper, we make a change to the standard 
SBM to require that the blocks corresponding to within-cluster connectivity will be expected 
to be denser than the blocks corresponding to between-cluster connectivity. This will lead to 
an algorithm which, unlike the SBM, will cluster the nodes according to the two factions in the 
karate club, as would be expected in a community-finding algorithm. 

Given a generative model and an observed network, we can check the posterior distribution 
and obtain a clustering, or set of clusterings, which are a good fit for the data. It is typically 
trivial to write MCMC algorithms to sample from the relevant distribution. However, it can be 
challenging to create suitably fast algorithms. We use collapsing along with algorithmic tech- 
niques such as the allocation sampler [10]; a scalable application of these ideas to the standard 
SBM is in 0. 

In applying these concepts to the SCF we run into a problem though. It does not appear to 
be possible to directly integrate out the relevant parameters to give us a fully collapsed model. 
However, we will show in this paper how we can work around this and still develop a suitable 
Metropolis-Hastings algorithm with the correct transition probabilities without having to resort 
to trans-dimensional RJMCMC0]. This technique is not a typical application of Metropolis- 
Hastings and it may have broader applicability, allowing faster algorithms with the simplicity 
of collapsing, in models where full explicit collapsing is not possible. 



Structure of this paper 

In Section [2] we will review the standard SBM of [11] - defining the basic notation and models 
which will be used throughout. In Section [3] we will define our modification to the SBM which 
we call Stochastic Community Finding (SCF). In Section [5] we will consider the issue of col- 
lapsing; this is straightforward for the SBM, but not for the SCF. In Section [5] we discuss the 
algorithm used in our softwar^l] which enables us to use Metropolis-Hastings even though we 
cannot write down the collapsed posterior mass in closed form. We then proceed to evaluations, 
first considering a synthetic network in Section \6\ and finally an analysis of Zachary's Karate 
Club and Sampson's Monks in Section [71 We close with a discussion of possible future directions 
in Section [HJ 



C-\ — h implementation, and datasets used, at https://sites.google.com/site/aaronmcdaid/sbm 
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In this section, we define the Stochastic Block Model (SBM) of [111 ] before discussing our mod- 
ification in the next section. We restrict our attention in this section to directed unweighted 
networks, where edges are simply present or absent. There are many extension^], for example 



allowing weighted networks with integer- or real- valued weights [8L \14\. 

We model a network of ./V nodes, and the network is represented as an adjacency matrix x. 
If there is a directed edge from node i to j, we have = 1. If they are not connected, we have 
x^ = 0. By default, we ignore self loops (xu) and they are simply left out of the formulae. 

Given a network x, our goal is to identify a clustering z. We use a vector z of length N, 
where Zi is the cluster to which node i is assigned. There are K clusters, 1 < Zi < K . 

Given K clusters, there are K x K blocks, one block for each pair of clusters. There is a 
K x K matrix ir which records, for each block, the expected density of edge-formation in that 
block. In other words, given node i which is in cluster k = and node j which is in cluster 
I = Zj, the probability of a connection is ir^i , 



x^ ~ Bernoulli^;) 

In the undirected variant we would have and only a single draw from the relevant 

Bernoulli would be used to assign to these. The probability of two nodes connecting depends on 
the clusters to which the nodes are assigned, but is otherwise independent of the particular nodes; 
this is the definition of blockmodelling. The elements of tt have a prior; tt^i ~ Beta(/3i, /?2)- Our 
default is to set /3i = 02 = 1 which means this prior is a Uniform distribution over (0,1). 

z is itself a random variable. There is a vector of length K which represents the probability, 
for each cluster, of any node being assigned to that cluster. Z{ Multinomial(l; 9\, 62, ■ ■ ■ Ok) 

9 is also a random variable and we place a Dirichlet prior on it. 



9 ~ Dirichlet(ai, «2, • • • , oik) (1) 

The parameters to the Dirichlet prior are a choice to be made by the user, and it is con- 
ventional to set each of the to the same value, a* = a, and we set a to 1 by default in our 
experiments. 

Given N and K, this is a fully specified generative model to generate many variables including 
the clustering z and the network x. We investigated this model in j^j. An important extension 
we introduced there is to place a prior on K ~ Poisson(l), thus allowing us to deal directly with 
the number of clusters as a random variable and avoids the need for any separate model selection 
criterion. See that paper for a more extended discussion of model selection and validation of 
the accuracy of the method in estimating the number of clusters. 



P(x, n, z, 9, K) = P(K) x p(z, 9\K) x p(x, n\z, K) 

where we use P(. . . ) for probability mass functions, i.e. of discrete quantities such as z or 
K, and p(. . . ) for probability density functions. 



2 directed or undirected, unweighted or integer-weights and other more complex 'alphabets' to describe an 
edge, self- loops modelled or ignored. 
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3 Stochastic Community Finding 

Now that we have defined the SBM, as introduced by [ill], we define the modification we are 
introducing in the Stochastic Community Finding (SCF) model. In community-finding, as 
opposed to block-modelling, we expect that if a pair of nodes are connected then the nodes are 
more likely to be clustered together than if they were not connected. 

P(^i — Zj\X{j — lj ^> P(^ — Zj\X{j — 0) 

Blockmodelling doesn't have such a constraint. This is not a hard rule in community-finding, 
it is a useful guide to help define the different goals in community-finding and block-modelling. 
An equivalent statement is 

F(-^2J Zj) ^ P(^J 7~ Zj) 

This is the formulation we use to define the SCF. We require that all the diagonal entries in 
tt be larger than the off-diagonal entries of tt. min(7r mm ) > max(-7Tfc/) for all m,k,l where k^l. 
Define a function v(tt) which returns 1 if tt satisfies the constraint, and returns if it does not. 



v (tt) 



1 if min(7r mm ) > max(-7r/y) for k ^ / 
otherwise 



Under this constraint, the probability density of the SCF model is proportional to f(x, tt, z, 9, K) 
where 



f(x, tt, z, 9, K) = Psbm(£, tt, z, 6, K) x v(tt) 

and Psbm is the probability density as defined by the SBM. This probability mass function is 
essentially identical to the SBM except that we have set the density to zero where the constraint 
on tt is not satisfied. A simpler form of the SBM has been investigated [lf| where all the 
diagonal entries in the blockmodel are taken to be equal to A and the all the off-diagonal entries 
are equal to e. Their model does not explicitly require that A > e, and hence it is not quite a 
community-finding model. 



4 Collapsing 

Given a network x, our goal is to estimate the number of clusters and to find the clustering 
(K,z). In the SBM as investigated by [8], it is straightforward to use collapsing and integrate 
out the other variables that we are not directly interested in such as tt and 6, 

Psbm(x, z, K) = P S bm(-F0 x Psbm(^|-^) x Psbm(>|z, K) 

= PsbmW x J PsbmMI-«0 d# x J ¥ sbm {x,tt\z,K) d^ 

allowing one to create an algorithm which, given x, samples (z,K). 

But this collapsing does not work in such a straightforward way with the SCF; we cannot, 
to our knowledge, write down a closed form expression for f(x,z,K) where tt and 6 have been 
integrated out. The problem is that it is difficult to integrate out tt in the SCF due to the 
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dependence structure between the blocks which is introduced by the constraint in Equation 
In the SBM, the elements of 7r are independent of each other. Also, given z, the various blocks 
within x which correspond to the elements of ir are independent of each other and dependent 
only on a single element of n. 

The model for K and and z are the same in the SCF as in the SBM, therefore we will 
simply use P(. . . ) and p(. . . ) for these. But for expressions involving ir it will make sense to 
use Psbm(---) and /(•••) to distinguish between the (normalized) probability distribution of 
the SBM and the (non-normalized) function for the SCF. We attempt to collapse as much as 
possible in order to get an expression for f(x, z, K), our desired stationary distribution: 

f(x,z,K) = P(K) x P{z\K) x J psBM^j^k)-^) x w ( 7r ) dvr 

= P(K) x P{z\K) x / P SBU (x\z,K) x Tp SBM (ir\x,z,K) x v(ir) d?r 

J (3) 

= P(K) x P(z\K) x P SBM (x\z,K) x J p SBM (ir\x,z,K) x v(ir) dvr 

= Psbm(z, z, K) x P S bm(w(vt) = l\x, z, K) 

The final factor in the final expression, Psbm(^(tt) = l\x,z,K), can be interpreted as the 
probability (under the SBM), given (x,z,K), that a draw of tt will satisfy the constraint; it is 
this factor that, to our knowledge, cannot be solved in closed form. The first factor in the final 
expression, P$ B m(x, z, K), can be directly taken from [8] as the relevant integration has been 
solved as described in the Appendices of that paper. In the following expression, we define n k 
to be the number of nodes in cluster k, i.e. n k is a function of z. Also, p k i is the number of 
pairs of nodes in the block between clusters k and I, i.e. p k \ = n^rii, and y k \ is the number 
of directed edges from nodes in cluster k to nodes in cluster I. We also use the Beta function 
a K a ,o) - r{a+b) . 



Psbm(x, z,K) = Psbm(^) x Psbm^I^) x P SBM (x\z,K) 

11 P(Ka) Yr T(n k + a) yryr B(y kl + p 1 ,p kl -y kl + p 2 ) ( 4 ) 
K\ e X T(N + Ka) ^ T(a) X ^ 11 B(/3 X , fo) 

where a is the user-specified parameter to the Dirichlet prior (eq. [T]). 

In a conventional Metropolis-Hasting algorithm (as in [8|), it is convenient to have closed 
form expressions of the posterior mass at each state in the chain. However, it is not necessary 
to have such expressions and we will see in the next section how we can work around this and 
develop a Markov Chain with the correct transition probabilities for the SCF even though we 
do not have a fully closed-form expression for f(x,z,K). 



5 MCMC algorithm 

In this section, we will describe the algorithm we have used to sample from the space of (z, K), 
with probability proportional to f(x,z,K) (Equation [3]) . We have extended the software we 
developed in 0] and we direct the reader to that paper for detailed definition of all the moves. 
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Algorithm for the SBM 

We will first summarize the procedure used in our SBM algorithm, and then describe the change 
necessary to turn it into an SCF algorithm. This means our initial goal is to describe an algorithm 
whose stationary distribution is proportional to Psbm(^> z, K). We define a proposal distribution 
which, given a current state s = (z,K), will propose a new state t = (z',K'). 

The proposals are defined by p, where p s t is the probability that, given the chain is in state 
s = (z,K), that it will propose to move to state t = (z',K'). Clearly, Y2tP st = 1 f° r an s - 
Given that a proposal has been made to move from s to t, where s / t, we define an acceptance 
probabality a S f. When the proposal is made, we will decide whether to accept or reject the 
proposal using a Bernoulli variable with probability a s t- 

In the SBM, where the desired stationary distribution is proportional to Ps&yi(x,z,K), we 
were able to use a standard Metropolis-Hasting [7( algorithm with acceptance probability 

. f, Pts Psbm(M)\ , k , 
a st = mm 1, — r- 5) 

V PstPsBM{x,S)J 

where Psbm(^,s) is defined as Psbm(x, z, K) and Psbm(^>£) is defined as Psbm(^, z', K'). For 
the SBM, the transition probabilities satisfy detailed balance: 

4 BM Pstast P S bm(^, (z',K') = t) 

4 BM Ptsats Psbm(K,(z,K) = s) U 

One of the moves is a simple Gibbs update on the position of one node, Zj. Node i is 
considered for inclusion in each of the K clusters. Another move is called M3, which involves 
proposing a reassignment of all the nodes in two randomly-selected clusters. AE is a move which 
proposes to split a cluster into two, increasing K, or merging two clusters into one, decreasing 
K. Together, these moves can visit all states (z,K). For full details see our earlier work 



which was based on existing algorithms 10|, [lj] . 



Algorithm for the SCF 

But our goal is to develop an algorithm for the SCF. We use the following scheme: First, 
make a proposal such as those used in the collapsed SBM algorithm Q|. Second, calculate the 
'SBM-acceptance probability' according to Equation [5j Third, make a draw from a Bernoulli 
with this probability to decide whether to Reject or to (provisionally) Accept. If the proposal 
was rejected, then there is no further work to be done, the proposal has been rejected. But, 
if the SBM-acceptance probability led to a (provisional) 'acceptance', then there is one final 
step required to decide on rejection or acceptance of the move; we draw from the posterior of 
tt\x, z' , K' , drawing a new tt conditioning on the (proposed) new values of z' and K' in state t; we 
fully accept the new state if and only if the tt satisfies the SCF validity constraint in Equation 
This procedure is giving in pseudocode in Table [TJ 

In this algorithm, a proposal s — > t (with s ^ t) will only be accepted if the SBM-acceptance 
succeeds and if the n\x,z,K satisfies the constraint. Given that the current state is s, the 
probability of transitioning to another state t is 

^ CF =Pst x a st x P S bm(w(vt) = l\x,z',K') 

We will shortly show that this algorithm is correct for drawing from the desired stationary 
distribution, but first we describe how to draw tt from the its posterior given (x,z,K). tt is a 
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Given current state , s = (z, K) 
Propose new state, t = (z',K') 
Calculate SBM-acceptance propability, a s t 
Draw a Bernoulli with probability a s t ■ 
If Failure: 



REJECT 
Else: 



Draw tt\x,z',K' from posterior 



Test if 7r satisfies v(tt) 



If Satisfactory: 



ACCEPT 



Else : 



REJECT 



Table 1. Pseudocode describing the acceptance and rejection rules in the SCF algorithm 

K x K matrix, one element for each block. In this posterior, as in the prior, these elements are 
independent of each other and therefore we proceed by estimating each element of tt separately. 
The prior on each element of tt is, as described earlier, a Beta(/3i, fa). The data for that block 
is the number of edges which appears, ytu and the number of non-edges that are in that block, 
Pki — Vki- In this case, the posterior is Beta(/3i + Hkhfa + Pki — Vkl)- F° r each element in 
7r, this posterior Beta is prepared and one draw is made from each. If the elements on the 
diagonal, 7r mm ~ Beta(/?i + y mm ,fa + Pmm — Umm), are greater than those off the diagonal, 
n k i ~ Beta(/?i + y k i, fa + Pki ~Vki), then the move is accepted. 

Now, we show that this satisfies detailed balance and that the stationary distribution is 
proportional to f(x,z,K). We reuse Equation [6] in this proof: 



We also use a method of label-switching which was introduced in 1 1 Of ] and which we used in 
[sj]. The chain will often visit states which are essentially equivalent to earlier states, but where 
the cluster labels have merely been permuted. The procedure involves permuting the labels of 
the clusters with the goal of maximizing the similarity of the latest state to all the previous 
states. This leads to more easily interpretable results from the chain. 

If it is possible to solve Equation [3] exactly, this would probably allow us to have larger 
acceptance probabilities and to increase the speed of the algorithm accordingly. Currently, the 
algorithm can, in theory, get trapped for some time in a state where the constraint typically fails 
for that state and for neighbouring states, making it difficult for the algorithm to climb towards 
better states. This is worth some further consideration, and perhaps an algorithm based on an 
uncollapsed representation might be best. A naive uncollapsed algorithm, where just one of z 



SCF 

st 



p st x a st x P S bm(>(» = ]\x, (z', K') = t) 
p ts x a ts x PsbmOO) = (z,K) = s) 
P SBM (x,(z',K') = t) x PsbmMtt) = l\x,{z',K')=t) 
Psbm(s, (z, K) = s) Psbm(v(tt) = l\x, (z, K) = s) 
f(x,(z',K') = t) 
f(x,(z,K) = s) 



■SCF 
ts 
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or 7r or 9 is updated in a move, would mix very slowly. It may be possible to use moves such as 
those in the allocation sampler to propose changes simultaneously to the clustering z and to the 
density matrix ir and to the cluster- membership-probability vector 9; such an algorithm may mix 
as well as the allocation sampler; such a method would also make it easier to efficiently handle 
the constraint. However, this method would be complex to implement; it may be worthwhile to 
investigate this further. 

6 Evaluation with synthetic data 

In this section, we evaluate the SCF on a simple synthetic network. We compare the results 
with those found by the basic SBM algorithm. If we generate data strictly according to the 
generative SCF model, then both algorithms tend to be quite accurate, see our earlier work 0] 
for a detailed analysis of the accuracy of the collapsed SBM MCMC algorithm. Therefore, in 
order to challenge the algorithms, instead we construct a network where the SBM and SCF get 
different results in order to demonstrate the preference of the SCF for 'community-like' structure. 
We consider the undirected network in Figure [H which has two star-like communities. Each of 
these communities has ten nodes, made up of two central nodes and eight peripheral nodes. 
Every central node is connected to every periphery node. 

This network has a more heterogenous degree distribution; this very loosely approximates 
the heavy-tailed degree distribution seen in many real- world networks. If we generate data 
strictly according to the SBM or SCF the degree distribution is more homogenous, especially 
the distribution of the degrees within a single cluster. 

In all the experiments in this section and the following section, we ran the algorithm for 
10,000,000 iterations. By default, we allow the algorithm to select the number of clusters itself 
as the allocation sampler algorithm naturally searches the entire search space. With this network, 
the SCF selects K = 2 and it clusters the nodes into the two star-like communities. The Markov 
Chain spends 97.5% of its iterations in that 'ground truth' state. 

On the other hand, the SBM select 4 clusters. It subdivides each of the two true communities 
into two further communities - one containing the central nodes and the other containing the 
peripheral nodes. We see this in Figure where very few edges are inside the found clusters. 




Figure 1. The '2 x 2' network. Two 'roles', peripheral and central. And two communities also, 
left and right. The SCF finds the two communities, and the SBM finds the roles. 
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□ 

Figure 2. The adjacency matrices showing the clusterings found by the SCF (left) and SBM 
(middle) on the '2 x 2' network (Figure [T]). The SCF has found the communities, with all 
edges inside the clusters, as expected. The SBM has divided the nodes according to degree and 
community, but there are no edges within any of the four clusters found by the SBM. On the 
right, we see how the SBM finds only the roles if the number of clusters is fixed at K = 2 in 
advance. Only the SCF has placed all the edges inside the clusters and correctly estimated the 
number of communities. 

Even if we restrict the SBM to consider only K = 2, then it again divides the nodes into central 
and periphery nodes. Regardless of the number of clusters, the SBM finds clusters which do not 
contain any of the edges; this is the opposite of what we expect in community finding. 

In networks there may be multiple types of structure that can be detected; the SCF focuses 
on finding the 'community-like' structure, where the clusters are expected to be internally dense. 
In synthetic and empirical networks with a heavy-tailed degree distribution the SBM may have a 
tendency to cluster nodes according to their degree, or other structural roles, and not according 
to community structure. 

7 Empirical networks 

In this section, we apply the SCF to two well-known social networks. 
Sampson's Monks 

Sampson [l^ ] gathered data on novices at a monaster^. There are 18 novices in the network 
and a pair are linked if they reported a positive friendship between them, giving us an undirected 
network. There were factions within the group, which Sampson labelled Loyal Opposition, Young 
Turks and Outcasts. 

We ran the SCF method on this dataset for 10,000,000 iterations. It estimated the number of 
clusters at 3, with 88.5% of the iterations. For 69% of the iterations, the clustering was exactly 
equal to the factions reported by Sampson. The network and adjacency matrix are shown in 
Figure El We also ran the data through the SBM. It found very similar results. This suggests 
that if the community structure is strong, then either algorithm can detect it. However, the 
SBM is slightly less accurate and only 55% of the iterations involve K = 3. This suggests that 
there is other structure, perhaps the high-degree versus low-degree structure, that is trying to 
assert itself. 

3 Sampson's monk data as an R package: http://rss.acs.unt.edu/Rdoc/library/LLN/html/Monks.html 
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Zachary's Karate club 



Now, we apply the SCF to a network of interactions at a karate club 15], again demonstrating 
the ability of the SCF to detect community structure where the SBM focusses on other types of 
structure. 

The members of the karate club were asked about their social interactions with other mem- 
bers, focusing on interactions outside of the lessons and tournaments. This gives us a network of 
34 members and 78 interactions. The interaction data is weighted, according to the number of 
distinct social interaction types reported by the members; a larger number is taken to indicate a 
stronger friendship^. After the survey was taken, the club split into two factions over a dispute 
of the cost of the lessons. The network is visualized in Figure [31 

This network has weighted edges and hence we apply our SCF constraint (eq. [2]) to the 
weighted variant of the SBM. The edges have a Poisson weight, and the rate of the Poisson, ttm, 
is different from each block and comes from a Gamma prior; full details of this edge model are 
in the Appendix of our earlier work0. 

If we fix the number of clusters at K = 2, then the SCF will correctly cluster the nodes 
according the split that occured in the club; the chain will spend 85.5% of its iterations in that 
state. This contrasts with the SBM, which instead clusters the nodes into 9 high-degree and 
25 low-degree nodes, a clustering which is quite different from the factional split; this SBM 
clustering is in Figure [5j The high-degree nodes include the leaders of each faction. 

Unfortunately, unlike our earlier networks, the SCF does not correctly estimate the number 
of clusters within the karate club. We had to specify that K = 2 in order to find the correct 
clustering, whereas our MCMC algorithm estimates K = 5. The issue of model selection within 
this model may be worth considering further. 



4 Weigbted karate club network: http: //vlado . fmf .uni-lj .si/pub/networks/data/Ucinet/zachary.dat 
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Figure 4. The karate club network of [15|. The width indicates the strength of the relationship 
by counting the number of distinct interaction types recorded between the two members. The 
club split in two after the survey and the colour of the node records the split. On the right is the 
adjacency matrix of this network. The rows and columns have been ordered according to which 
faction the node is in; most of the edge weight is on the top-left and bottom-right, as would be 
expected in good community structure. The SCF algorithm finds this clustering when K = 2. 




8 Conclusion 

Community finding is popular in the social science literature, but many statistical models 
are defined for block-modelling, not explicitly for community-finding. In order to investigate 
community-finding, we have introduced a constraint that the density inside clusters be larger 
than the density between pairs of clusters. We have extended an existing block-modelling 
method, which was based on the Stochastic Block Model (SBM), to take account of this con- 
straint. We evaluated the method and shown it can detect community structure where the SBM 
cannot. 
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